-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-5533] Support spark columns comments #8683
base: master
Are you sure you want to change the base?
Conversation
...datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
Outdated
Show resolved
Hide resolved
Scala Option does not have ofNullable (java Optinonal do have). BTW |
case INT => avroSchema.getLogicalType match { | ||
case _: Date => SchemaType(DateType, nullable = false) | ||
case _ => SchemaType(IntegerType, nullable = false) | ||
(avroSchema.getType, Option(avroSchema.getDoc)) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conversion tool is copied from Spark: https://github.com/apache/spark/blob/dd4db21cb69a9a9c3715360673a76e6f150303d4/connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L58, just noticed that Spark also does not support keeping comments from Avro fields while doing the converison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely spark also have this limitation when retrieving schema from avro. But spark don't usually infer spark schema from avro. Hudi does, and that's the reason of the patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write a test case for it, especially for creating table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405 added a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of feel there is no need to change each match for every data types, can we write another method similiar with toSqlTypeHelper
which invokes toSqlTypeHelper
firstly then fix the comment separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of feel there is no need to change each match for every data types
Any columns including nested columns also may have comments, so I don't see why we should'nt look after all avro content for doc.
which invokes toSqlTypeHelper firstly then fix the comment separately.
This would lead to walk thought the avro schema two times, and also lead to complex merge of results. Am i missing something ?
while |
…umns # Conflicts: # hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
I have investigated a bit, and here my current understanding: Reading hudi table w/ spark has two path:
So right now, using this PR and setting both To fix this we could:
I would go for Thought @danny0405 @bhasudha @yihua ? |
@danny0405 The PR is ready. The way to activate comment for all engines is as below:
To recap:
BTW, I will provide some details about comments in the doc IYW The reason spark datasource should be disabled is our current table property builder rely on parquet types which does not know about comments. It would be painful to modify and spark table properties are useless and leads to errors. |
@danny0405 just learned here why the spark data source stuff shall be kept within hms
Then there is a need to update the datasource code to be aware of comments within hive sync. @prashantwason can you confirm this part is still applicable and as a result |
@danny0405 implemented hive sync datasource, so to clarify about what brings this PR :
#4960 provided:
#8740 provided:
Not covered yet:
|
Thanks @parisni , would find some time for the review ~ |
@@ -212,8 +213,9 @@ private void syncSchema(String tableName, boolean tableExists, boolean useRealTi | |||
Map<String, String> tableProperties = ConfigUtils.toMap(config.getString(ADB_SYNC_TABLE_PROPERTIES)); | |||
Map<String, String> serdeProperties = ConfigUtils.toMap(config.getString(ADB_SYNC_SERDE_PROPERTIES)); | |||
if (config.getBoolean(ADB_SYNC_SYNC_AS_SPARK_DATA_SOURCE_TABLE)) { | |||
List<FieldSchema> fromStorage = syncClient.getStorageFieldSchemas(); | |||
Map<String, String> sparkTableProperties = SparkDataSourceTableUtils.getSparkTableProperties(config.getSplitStrings(META_SYNC_PARTITION_FIELDS), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fromStorage
-> fieldSchema
@@ -65,7 +65,7 @@ public void runSQL(String s) { | |||
try { | |||
stmt = connection.createStatement(); | |||
LOG.info("Executing SQL " + s); | |||
stmt.execute(s); | |||
stmt.execute(escapeAntiSlash(s)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the escape related with this change ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, both JDBC and HMS generate sql and antislash should be doubled escaped otherwise it is lost.
Antislash is used to escape double quotes in the comments DDL.
|
||
/** | ||
* SQL statement should be be escaped in order to consider anti-slash | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every sentence should end up with .
. Every new paragraph should start with <p>
, remove the params comments if there are no comments at all.
private static String getComment(String name, List<FieldSchema> fromStorage) { | ||
return fromStorage.stream() | ||
.filter(f -> name.equals(f.getName())) | ||
.filter(f -> f.getComment().isPresent()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we only match from the first level for the fields by name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's the current limitation of comments support. It's a flatten list of fields.
|
||
private static String escapeQuote(String s) { | ||
return s.replaceAll("\"", Matcher.quoteReplacement("\\\"")); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why we need a escape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see UT. I added a double quote in the column comment, and for some hive sync ddl exectutors, this is needed
@@ -133,7 +154,7 @@ private static String convertPrimitiveType(PrimitiveType field) { | |||
|
|||
private static String convertGroupField(GroupType field) { | |||
if (field.getOriginalType() == null) { | |||
return convertToSparkSchemaJson(field); | |||
return convertToSparkSchemaJson(field, Arrays.asList()); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean composition data type is not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, currently in hive sync we don't have the composition data type List<FieldSchema>
is flat
messageType); | ||
messageType, | ||
// flink does not support comment yet | ||
Arrays.asList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collections.emptyList() ?
Change Logs
fixes #7531 ie: show comments within spark schemas
Impact
Describe any public API or user-facing feature change or any performance impact.
Risk level (write none, low medium or high below)
None
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist